2,467 research outputs found
ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R
We introduce the C++ application and R package ranger. The software is a fast
implementation of random forests for high dimensional data. Ensembles of
classification, regression and survival trees are supported. We describe the
implementation, provide examples, validate the package with a reference
implementation, and compare runtime and memory usage with other
implementations. The new software proves to scale best with the number of
features, samples, trees, and features tried for splitting. Finally, we show
that ranger is the fastest and most memory efficient implementation of random
forests to analyze data on the scale of a genome-wide association study
Drought effects on biofuel feedstock production by Populus trichocarpa
As the world population continues to increase, so does the need for sustainable sources of fuel. Biofuels are of particular interest and could be an economically feasible fuel source given the right conditions. Populus trichocarpa, is a rapidly growing plantation species that, in addition to having a fully sequenced genome available for study, displays a wide range of phenotypic traits among genotypes. By analyzing these differences in both plantation and more controlled greenhouse settings, we aimed to discover which genotypes performed the best under drought conditions, and which physiological mechanisms granted them that high performance. In the field, differences in heights and stress tolerance among genotypes were observed, and 60 genotypes of differing water-limitation resistance were selected for further measures. No differences between resistance groups were seen in the physiological measures taken, yet the more resistant genotypes had higher stress tolerances indices and grew taller than susceptible genotypes from similar latitudes. The greenhouse study confirmed the water-limitation resistance rankings for 80% of the genotypes and found that resistant genotypes expressed greater midday stomatal control, enabling them to conserve water. Despite this temporary shutdown to photosynthesis, resistant genotypes assimilate carbon at a higher rate than the susceptible genotypes and can maintain their growth advantage. The quick response rate to water-limited conditions correlates with latitude and water availability of the collection site for the clones, suggesting that clones that do not regularly experience water-limitation are more sensitive to it and are able to make short-term adaptations to avoid such conditions. Further evaluation will be needed to examine if these short-term adaptations can maintain growth over extended periods of drought or on marginal lands in order for these genotypes to be a viable candidate for a rotational crop used for biofuel production
Block Forests:random forests for blocks of clinical and omics covariate data
Background
In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available.
Results
We identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application.
Conclusions
The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type
arfpy: A python package for density estimation and generative modeling with adversarial random forests
This paper introduces , a python implementation of
Adversarial Random Forests (ARF) (Watson et al., 2023), which is a lightweight
procedure for synthesizing new data that resembles some given data. The
software equips practitioners with straightforward
functionalities for both density estimation and generative modeling. The method
is particularly useful for tabular data and its competitive performance is
demonstrated in previous literature. As a major advantage over the mostly deep
learning based alternatives, combines the method's reduced
requirements in tuning efforts and computational resources with a user-friendly
python interface. This supplies audiences across scientific fields with
software to generate data effortlessly.Comment: The software is available at https://github.com/bips-hb/arfp
Testing Conditional Independence in Supervised Learning Algorithms
We propose the conditional predictive impact (CPI), a consistent and unbiased
estimator of the association between one or several features and a given
outcome, conditional on a reduced feature set. Building on the knockoff
framework of Cand\`es et al. (2018), we develop a novel testing procedure that
works in conjunction with any valid knockoff sampler, supervised learning
algorithm, and loss function. The CPI can be efficiently computed for
high-dimensional data without any sparsity constraints. We demonstrate
convergence criteria for the CPI and develop statistical inference procedures
for evaluating its magnitude, significance, and precision. These tests aid in
feature and model selection, extending traditional frequentist and Bayesian
techniques to general supervised learning tasks. The CPI may also be applied in
causal discovery to identify underlying multivariate graph structures. We test
our method using various algorithms, including linear regression, neural
networks, random forests, and support vector machines. Empirical results show
that the CPI compares favorably to alternative variable importance measures and
other nonparametric tests of conditional independence on a diverse array of
real and simulated datasets. Simulations confirm that our inference procedures
successfully control Type I error and achieve nominal coverage probability. Our
method has been implemented in an R package, cpi, which can be downloaded from
https://github.com/dswatson/cpi
- …